Recovering Internet Service Sessions from Operating System Failures Motivation and Approach

نویسندگان

  • Florin Sultan
  • Aniruddha Bohra
  • Stephen Smaldone
  • Yufei Pan
  • Pascal Gallard
  • Iulian Neamtiu
  • Liviu Iftode
چکیده

Critical Internet services such as ecommerce, online auctions, and banking run on complex, multi-tier architectures built with commodity (offthe-shelf) machines and operating systems. These stateful services are sensitive to server failures: active client sessions on these servers are lost, although the state associated with them might still be intact in a failed machine’s memory. We developed a recovery approach that exploits hardware and software redundancy in Internet service installations to reuse active clients’ session state after OS failures (http://discolab. rutgers.edu/bda). Our lightweight, application-independent system provides both failure detection and recovery, for use with complex, multi-tier Internet services. The core of the system is the novel Backdoors (BD) architecture,1 which uses commodity programmable network interface cards (NICs) with specialized firmware and OS extensions to provide remote access to lightweight application and OS state in a machine’s memory without relying on its OS or processors. Using BD, machines in an Internet server cluster can cooperatively observe each other’s health, detect failures, and take over client sessions from failed nodes. In this article, we describe the BD architecture and our OS extensions for monitoring and recovery of service sessions. We have implemented a prototype in the FreeBSD 4.8 kernel, using Myrinet LanaiXP programmable NICs (www.myri.com). The results from our experiments with the Rice University Bidding System (Rubis; http://rubis.objectweb.org), a cluster-based

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonintrusive Failure Detection and Recovery for Internet Services Using Backdoors

We describe an architecture for nonintrusive failure detection and recovery in a cluster of Internet servers in which nodes mutually monitor their liveness and recover client sessions from failed nodes. The system is based on Backdoors, a novel architectural approach for remote healing of computer systems. Backdoors enables monitoring and recovery/repair of state in a computer system by remote ...

متن کامل

Surviving Internet Catastrophes

In this paper, we propose a new approach for designing distributed systems to survive Internet catastrophes called informed replication, and demonstrate this approach with the design and evaluation of a cooperative backup system called the Phoenix Recovery Service. Informed replication uses a model of correlated failures to exploit software diversity. The key observation that makes our approach...

متن کامل

Recovering from Faulty Device Drivers

Several studies (see Swift et. al.’s study of Windows XP in SOSP 2003 and Chou et. al’s study of Linux in SOSP 2001) have attributed a large fraction of operating system failures to device driver flaws. Not only can driver errors cause kernel instability, but these errors can also be exploited for privilege escalation and access to kernel data structures. A search on securityfocus.com shows vul...

متن کامل

Service Continuations: An Operating System Mechanism for Dynamic Migration of Internet Service Sessions

We propose service continuations (SC), an OS mechanism that supports seamless dynamic migration of Internet service sessions between cooperating multi-process servers. Service continuations provide a server application with a simple and easy to use abstraction, and a means to migrate the service state along with the serviced connection. SC supports transparent resumption of service to the clien...

متن کامل

Microreboot - A Technique for Cheap Recovery

A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting – a fine-grain technique for s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005